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ABSTRACT 

Motivation: Tlie increasing availability of metabolomics data 
enables to better understand the metabolic processes involved in 
the immediate response of an organism to environmental changes 
and stress. The data usually come in the form of a list of metabolites 
whose concentrations significantly changed under some conditions, 
and are thus not easy to interpret without being able to precisely 
visualize how such metabolites are interconnected. 
Results: We present a method that enables to organize the data 
from any metabolomics experiment into metabolic stories. Each 
story corresponds to a possible scenario explaining the flow of 
matter between the metabolites of interest. These scenarios may 
then be ranked in different ways depending on which interpretation 
one wishes to emphasize for the causal link between two affected 
metabolites: enzyme activation, enzyme inhibition or domino effect 
on the concentration changes of substrates and products. Equally 
probable stories under any selected ranking scheme can be further 
grouped into a single anthology that summarizes, in a unique subnet- 
work, all equivalently plausible alternative stories. An anthology is 
simply a union of such stories. We detail an application of the 
method to the response of yeast to cadmium exposure. We use this 
system as a proof of concept for our method, and we show that we 
are able to find a story that reproduces very well the current know- 
ledge about the yeast response to cadmium. We further show that this 
response is mostly based on enzyme activation. We also provide a 
framework for exploring the alternative pathways or side effects this 
local response is expected to have in the rest of the network. We 
discuss several interpretations for the changes we see, and we sug- 
gest hypotheses that could in principle be experimentally tested. 
Noticeably, our method requires simple input data and could be 
used in a wide variety of applications. 

Availability and implementation: The code for the method 
presented in this article is available at http://gobbolino.gforge.inria.fr. 
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1 INTRODUCTION 

One of the main goals of metabolic studies is to understand the 
metabolic processes involved in the adaptation to an environ- 
mental change. Recently, metabolomic techniques gained the 
spotlight by providing a way to monitor metabolism by measur- 
ing the concentration of metaboUtes in different conditions or at 
different time points. A typical result from such an experiment is 
a list of metabolites whose concentrations significantly changed 
when the cell or organisin was exposed to some stress. How to 
interpret this list became then a new research topic, consisting in 
identifying the metabolic processes that link the metabolites 
of interest, possibly explaining the observed variations in their 
concentrations. This topic goes in the literature by the name 
of 'metabolite set enrichment analysis', and is an extension to 
metabolism of work that was initiated in the context of transcrip- 
tomics and then proteomics under the name of 'gene set enrich- 
ment analysis' [see (Subramanian et ah, 2005) for what is 
possibly the first work on this and (Khatri et ah, 2012) for a 
recent survey]. The simplest idea one may think of is to highlight 
the set of metabolites identified in the experiment that have 
significantly changed their concentration, let us call them discri- 
minating compounds, and then to visually analyze their intercon- 
nections. This is what is done notably in Xia et al. (2012). 
However, like a number of other approaches on metabolite set 
enrichment analysis, the projection of enriched metabolites is 
done on pathways instead of the whole network, thereby missing 
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(alternative) pathways not annotated in current databases, or 
more generally paths traversing several pathways. 

For genome-scale networks, the metabolism of a whole organ- 
ism is considered, which may be large (Thiele and Palsson, 2010), 
whereas a metabolic perturbation caused by some stress condi- 
tion may impact only a small portion of this complex network. 
Even if it is sometimes possible to visually identify the pathways 
that explain some of the variations in the monitored metabolites, 
getting an overall explanation for all the observed variations 
usually cannot be performed by visual inspection. 

Recently, automatic methods have been proposed to deal with 
this kind of data (Antonov et al., 2009; Dittrich et al., 2008; 
Faust et al., 2010; Leader et al., 2011). A natural idea is to try 
to link all discriminating compounds through chains of reac- 
tions. One possible model for this is by means of a Steiner 
tree, which is a minimum cost tree that connects all nodes 
belonging to a predefined subset called terminals, which in the 
case of metabolism would be the discriminating compounds 
(Dittrich et al., 2008; Scott et al., 2005). However, any pair of 
metabolites may be connected through several alternative paths 
within a network, and each of these paths may validly explain the 
observed changes of concentration. In this context, the extraction 
of subgraphs appears to be more relevant than the extraction of 
subtrees. The number of alternative paths between two metabol- 
ites may, however, be large and restricting the search to all 
the shortest or lightest (the weight is given by the sum of the 
out-degrees of the vertices in the path) paths between pairs of 
metabolites seems to be a realistic compromise. 

This is the approach followed by (Croes et al., 2006; Faust 
et al., 2010; van Helden et al., 2002) where the authors concen- 
trate on a pair of discriminating compounds and search for sub- 
graphs corresponding to source-to-sink paths between them. In 
Antonov et al. (2009), this approach is pushed one step further as 
the authors consider all pairs of metabolites and unify all the 
shortest paths, this time with a maximum length k. In practice, 
this may lead to large networks (ii k is too big) or to disconnected 
ones (if k is too small). 

The aforementioned methods are based only on the topology of 
the network, but one could consider different approaches based 
on fiux distributions over the set of reactions, such as elementary 
modes (Schuster and Hilgetag, 1994; Schuster e? a/., 1999) that are 
minimal subnetworks working at steady state. One difficulty in 
this case is that flux-based models need stoichiometric values as 
well as a definition of the boundaries of the system under analysis, 
which are not always simple to identify, particularly in the case of 
a metabolomics experiment in which the list of discriminating 
compounds does not directly define such boundaries. Moreover, 
flux approaches are focused on reactions and are not designed to 
take into account endogenous metabolite concentrations. The 
very same metabolites may play different roles in different meta- 
bolic processes, being source in one, intermediate in a second and 
target in a third one. The inabihty and the unwillingness to tell, a 
priori, the role of the discriminating compounds in each scenario 
to be proposed is a key factor of our approach: we are interested 
not only in connecting the discriminating compounds but also in 
establishing their individual role for each scenario. 

Our approach is a subgraph extraction technique in which we 
want to find maximal directed acyclic subgraphs (DAGs) whose 
set of sources and targets are discriminating compounds. We call 



such subgraphs metabolic stories, or for short, simply stories. In 
practice, for a given set of discriminating compounds, the 
number of stories may be large. Because we do not have a 
clear criterion for choosing which of these stories is the most 
relevant, we first aim at enumerating them all. In a second 
step, we discuss ways to rank them based on how the concentra- 
tion of the discriminating compounds is observed to vary in the 
experiment. This procedure allows a good filter of the solutions, 
selecting stories that best fit the experimental data. 

2 MODELS 

2.1 Modeling metabolic stories 

In this section, we introduce the notion of story and give a 
rationale for its definition. Briefly, stories are subgraphs that 
summarize the flow of matter from a set of source metabolites 
to a set of target metabolites. The candidates to be the endpoints 
(sources or targets) of a story should belong to the set of 
discriminating compounds. To guarantee that stories will have 
at least one source and one target, we introduce the acyclicity 
constraint. These two combined constraints lead us to search for 
DAGs with sources and targets contained in the given set of 
discriminating compounds. Then, because there can be several 
paths connecting two discriminating compounds and we want 
the story to contain all these alternative paths, we impose a con- 
straint of maximality, that is, we search for maximal DAGs, in 
the sense that alternative pathways between all the nodes should 
be included, if their addition does not create cycles. In other 
words, a DAG is maximal if by adding any arc makes it not a 
DAG anymore, meaning that it contains at least one cycle. 

Our goal is to have an algorithm that enumerates all stories, 
i.e. to provide all possible scenarios that explain the observed 
transformations. Because our focus is on the relation between 
discriminating compounds, we use a representation of metabolic 
networks focused on metabolites, the so-called compound net- 
work (Lacroix et al., 2008), that is a directed graph in which 
vertices are compounds and there is an arc from a compound 
to another compound if there is a reaction that consumes the first 
to produce the second. 

More formally, we introduce a constrained version of the prob- 
lem of enumerating all maximal DAGs of a graph G 
(Schwikowski and Speckenmeyer, 2002). Let G = (B U W, £) be 
a directed graph such that IB n W = 0. We write V=UUW. 
Nodes in B are said to be black nodes and correspond to the 
discriminating compounds, whereas those in W are said to be 
white nodes. Let d^{u) and d^(u) denote, respectively, the in- 
degree and the out-degree of a node u. Node u is called a source 
if (w) = 0 and (ii)>0 and a target d^{ii) = 0 and d^(u)>0. 

A metabolic story of G is a maximal acyclic subgraph 
G' = (B U W, E') of G with W c W and E Q E and such 
that, for each node \v e W, w is neither a source nor a target 
red in G'. Maximality means that it is not possible to add other 
arcs or nodes without creating cycles, or white sources or targets. 
We denote by E(G) the set of stories of G. 

2.2 Enumerating metabolic stories 

A first step of our algorithm to enumerate E(G) is to apply 
compression operations on the input graph obtaining a more 
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compact representation, which is equivalent in terms of story 
sets. The operations are (i) white source and target removal 
that consists in removing iteratively white nodes that are either 
sources or targets, as sucli nodes cannot appear in any story; 
(ii) self-loop removal that consists in removing all arcs of the 
form (w, n): because stories are acyclic, such arcs do not appear 
in any story; (iii) forward and backward bottleneck removal, that 
consists in removing a white node v whose out-degree (respect- 
ively, in-degree) is equal to I, and directly connecting any pre- 
decessor (respectively, successor) of v to the unique successor 
(respectively predecessor) of v (without creating multiarcs). Our 
preprocessing algorithm consists in applying operations (i), (ii) 
and (iii) successively until no more white sources and targets, 
self-loops and bottlenecks are present in the graph. We call the 
resulting graph a compressed network. 

In (Acuna et al., 2012), we proposed a first method to enu- 
merate stories based on a polynomial-time algorithm to compute 
one story. This is briefly recalled in Supplementary Material SI. 
More recently we developed a much faster enumerator for stories 
based on a linear-time enumeration algorithm for non-maximal 
stories (Borassi et al, 2013) that allows us to explore the whole 
set of solutions even for genome-scale metabolic networks. This 
is the enumerator algorithm we use here. 

2.3 Scoring function 

From a formal point of view, there is no qualitative difference 
between any two stories. In this sense, whether a given discrimi- 
nating compound is a source, an intermediate node or a target in 
a story is indifferent for the enumeration process, as all possible 
scenarios satisfying the three properties given by the definition, 
namely, maximality of paths, acyclicity and source/target con- 
straint, have to be computed. 

However, in practice, the number of stories can be large and 
being able to rank them greatly facilitates their analysis. To do 
this, we propose the following score function: 

s{S) = ^ CLi{x) X oy{y) X co{x'^ y), 

where the score s{S) of a story S is the sum, for each black 
transformation x-^y, of the product of the node weights (d{x) 
and ft)(j') of the nodes .v and times the path weight o}{x-^y). 
A black transformation is defined as an arc or a simple white 
path between two black nodes. A simple white path is a simple 
path (i.e. containing no cycles) composed of only white nodes 
between two black ones. The values assigned to the node and 
path weights will depend on the data available and are thus 
perfectly suited for the integration of various omics data. For 
our analysis, we used the topology of the stories and additional 
data from the metabolomics experiments as described in more 
detail in the Section 3 (see Table 2). 

2.4 Yeast metabolic network 

For the analysis of the metabolics experiment (Madahnski et al., 
2008), we used the metabolic reconstruction of Saccharomyces 
cerevisiae s288c available in MetExplore (Cottret et al., 2010) 
(the metabolic model was built based on the YeastCyc database). 
The procedure followed is briefly described in Supplementary 
Material S2. 



3 RESULTS 

3.1 Metabolic stories to analyze metabolomics data 

To illustrate how to use our method, we concentrate on the study 
of the exposition of S.cerevisiae to the toxic cadmium {C(f^) 
reported in Madalinski et al. (2008). A widely studied metabolic 
pathway in S.cerevisiae is the one responsible for glutathione 
biosynthesis, as it is related to the detoxification process of the 
cell when exposed to high concentrations of cadmium (Fauchon 
et al, 2002; Lafaye et al, 2005; Madalinski et al., 2008). Previous 
studies demonstrated that the presence of such a metal in the 
environment has a huge impact in terms of gene expression and 
metabolism, showing that there is a strong response both at the 
metabolomic and proteomic levels. Basically, glutathione needs 
to be produced because it is a thiol metabolite linked to the 
detoxification of cadmium through a process called chelation 
(Li et al., 1997). Plants are the natural biotope of S.cerevisiae 
and it is known that they are able to tolerate cadmium and other 
metals up to 1 % of their dry weight, which is believed to provide 
defense against herbivores and pathogenic microorganisms 
(Fauchon et al., 2002). This exposition to cadmium in natural 
conditions provides a reason for yeast to keep a detoxification 
pathway. However, the biosynthesis of glutathione requires high 
quantities of sulfur. To save sulfur, there is a replacement of 
abundant sulfur-rich proteins related to other metabolic pro- 
cesses by sulfur-depleted isozymes (i.e. other enzymes that have 
the same function). Such is the case for the enzymes pyruvate 
decarboxylase (Pdclp), enolase (Eno2p) and aldehyde dehydro- 
genase (Ald6p) that are replaced by isozymes containing less 
sulfur amino acids (i.e. methionine and cysteine) that are mobi- 
lized in the glutathione pathway and are less available for protein 
synthesis (Fauchon et al., 2002). This response affects a large 
portion of the metabolic network and represents the mechanism 
used by the cell to survive under this specific stress condition. 
Sulfur limitation conditions slow down the growth rate but do 
not induce this same sulfur-sparing response (Fauchon et al., 
2002). A schema of the known glutathione biosynthesis meta- 
bolic pathway is presented in Figure 1 . 

The metabolic network used for this analysis (see the Section 2 
for a description) contains 600 metabolites and 949 arcs. 
Madalinski et al. (2008) identified a list of 24 metabolites 
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Fig. 1. Glutathione biosynthetic pathway. Compounds in bold are dis- 
criminating in Madalinski et al. (2008) and are involved in the synthesis of 
glutathione. Source: adapted from Figure 1 in Lafaye et a/. (2005) 
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whose concentration significantly changed after cadmium expos- 
ure, shown in the table given in Supplementary Material S3. 

It is important to notice here that identification of the metab- 
olites that have changed their concentration is based on a min- 
imum of two orthogonal criteria relative to an authentic 
compound analyzed under identical experimental conditions: 
retention time and mass spectrum or retention time and 'H 
nuclear magnetic resonance (NMR) spectra, accurate mass and 
tandem mass spectra or accurate mass and related isotopic clus- 
ters or 'H and/or '^C NMR with 2D NMR spectrum (Sumner 
et al., 2007). However, many metabolites are not commercially 
available and many of them may require tedious and expensive 
chemical synthesis, which often hampers their definitive metab- 
olite identification. Thus, such compounds remain putatively 
annotated or characterized. 

We decided to perform two analyses to explore the effect of 
cadmium exposure on S.cerevisiae cells. We first enumerated 
metabolic stories using a set of black nodes restricted to the 
measured metabolites that are known to participate to the bio- 
synthesis of glutathione. The idea is to check whether our 
method is able to recover one or more stories that correspond 
to the known metabolic pathway. In a second step, we enumer- 
ated metabolic stories using the entire list of 24 discriminating 
compounds identified in the metabolomics experiments. In this 
case, the goal is to analyze both the response of glutathione 
biosynthesis, but also the potential response of other pathways 
and the side effects of these responses in the rest of the network. 

3.2 First analysis: local response to cadmium exposure, 
biosynthetic pathway of glutathione 

We first consider the aforementioned metabohc pathway directly 
involved in cadmium detoxification, namely, the glutathione 
biosynthetic pathway, to enumerate stories and check whether 
we are able to recover one that fits our current knowledge of the 
biological process. We thus selected as black nodes for this first 
analysis only the metabolites that were measured in the experi- 
ment (Madalinski et al., 2008) and that are also known to par- 
ticipate in the glutathione biosynthetic pathway (Fauchon et al., 
2002). These eight compounds are presented in the table given in 
Supplementary Material S3 with the third column marked as 
'yes': glutathione, O-acetylhomoserine, methionine, glutamate, 
glutamylcysteine, serine, glycine and cystathionine. 

3.2.1. Compres.sed network A first practical result that follows 
directly from the properties of our definition of stories is the com- 
pressed representation of the subnetwork in which all interactions 
between the discriminating compounds are captured. The com- 
pression is obtained in two steps. In the first step, we extract all 
biologically relevant routes between the black nodes. In our case, 
we computed lightest paths between black nodes using the out- 
degree of a node as its weight, which has been defended as being 
more biologically sound than a simple shortest-path approach 
(Blum and Kohlbacher, 2008). The second step is to apply the 
four compression rules that were previously described briefly (see 
Section 2) and that are fully detailed in Acuna et al. (2012). 

The compressed network obtained for the reduced set of black 
nodes contains 10 nodes and 25 arcs, i.e. represents >98% of 
compression in terms of nodes and >97% in terms of arcs with 
respect to the original input size of the S.cerevisiae metabolic 




L-GAMMA-GLUTAMYLCYSTEINE 

Fig. 2. The compressed network computed considering as black nodes 
the eight compounds of the table in Supplementary Material S3 marked 
as present in the glutathione biosynthetic pathway. Green nodes are the 
ones whose concentration significantly increased in the presence of cad- 
mium, whereas the red are the ones whose concentration significantly 
decreased. The diameter of the nodes is proportional to the concentration 
change. Solid arcs represent single reactions connecting the two com- 
pounds, whereas dashed ones correspond to a chain of at least two 
reactions 

network. The resulting compressed network is shown in 
Figure 2. This compression ratio is spectacular, as it is now 
much easier to visually inspect the network in which we can 
highlight the metaboUtes of interest. This type of visualization 
is, therefore, already a result in itself which can readily be used 
to start proposing causal explanations for the changes of metab- 
olite concentrations. To facihtate this, we further enrich this 
representation with the information on the direction of the 
change of concentration (whether the metabohte concentration 
increased or decreased) and the intensity of this change. Of the 8 
metaboUtes considered, 3 had a significant increase of their con- 
centration (reduced glutathione, cystathionine and glutamylcys- 
teine), whereas the other 5 had a significant decrease of their 
concentration (methionine, O-acetylhomoserine, glutamate, 
serine and glycine). From now on, we will denote the first set 
as green nodes and the second set as red nodes. The other nodes, 
whose concentration did not change significantly, will remain 
identified as white nodes. We notice that this distinction between 
red and green nodes is only possible for appUcations where two 
conditions are compared. This is the case we consider in this 
article. When more than two conditions are compared, our meth- 
odology still applies, keeping the terminology of black and white 
nodes. We can produce the compressed network and enumerate 
the stories. The ranking scheme described later would, however, 
need to be adapted. Finally, during the preprocessing of the 
network, some paths are compressed into a single arc. To distin- 
guish between reactions linking two compounds and these com- 
pressed paths, we used solid lines for the former and dashed lines 
for the latter. Importantly, the compression of the network is 
lossless as it is easily reversible, for instance if we need to have 
access to the full path of white nodes that indirectly link two 
black nodes. Interestingly, in practice, although most white 
nodes can be compressed, some remain. Their compression 
would prevent us from being able to enumerate the full set of 
stories. These compounds, although not detected as discriminat- 
ing, seem to also play an important role in the studied process as 
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they are at the crossroads between at least two possible routes 
between discriminating compounds. 

3.2.2. Enumerating and .scoring the .stories The compressed 
network is already a result per se, but its visual inspection 
remains difficult; the many cycles it contains allow for a reading 
of the flow of matter in many possible directions, thereby sug- 
gesting several possible causal scenarios. Therefore, we go one 
step further in the analysis and enumerate the metabolic stories. 
In this analysis, there are a total of 222 stories. 

With the aim of classifying the set of computed stories, we 
have to define how to assign values to the node and arc weights 
needed by our score function scheme (see Section 2). 

There are basically four kinds of interactions that may be 
observed in a metabolic story (see Table 1). In the following 
proposal for causal interpretation of each type of arc, we will 
make the simplifying assumption that each arc is independent 
from the other ones. In this context, an arc linking a red node to 
a green node will correspond to the consumption of the red node 
to the benefit of the green node. If we focus solely on this arc, this 
can only be explained by an activation of the enzyme catalyzing 
the reaction linking the two nodes. On the other hand, an arc 
linking a green node to a red node can be interpreted as the 
inhibition of the enzyme catalyzing the reaction linking the two 
nodes. Finally, an arc linking two red nodes can be explained 
by a domino effect. The simple fact that the substrate concentra- 
tion decreases causes the product concentration to decrease. 
This domino effect does not require any enzyme change. It just 
corresponds to a change in concentration that propagates. The 
case of green to green arcs can be explained by a similar effect. 
We additionally need to assume that the enzyme is not present in 
a limiting amount. 

We remind that in this section, our approach is local and 
focuses on single transformations. We always favor the most 
parsimonious explanation (the one with fewest enzyme changes), 
but, in practice, other plausible explanations could be proposed 



Table 1. Biological interpretation for arcs in a story 



Arc 


To red 


To green 


From red 


Concentration change 


Enzyme activation 


From green 


Enzyme inhibition 


Concentration change 



for each arc. Importantly, the notion of enzyme activation 
or inhibition as used in this article should be understood in a 
general sense as it captures allosteric regulation of the enzyme 
or transcriptional regulation of the gene(s) encoding the enzyme. 
In the application considered here, the time separating the 
measurements (before and after exposure to cadmium) is large 
enough to allow to interpret enzyme activations as a change 
in their concentration through a transcriptional response. Our 
methodology also applies when the time separating the measure- 
ments is shorter. In the following, we propose three ranking 
schemes for stories. In each of them we favor one type of arc, 
which means that we look for the stories with a large number of 
arcs of this type. Even if the individual explanation of each arc is 
not necessarily correct, the overall optimization of the total 
number of each arc type makes intuitive sense, and we show 
that in practice it enables to explore efficiently the space of all 
stories. 

3.2.3 Tliree .scoring .schemes Let us start by defining the arc 
weights that are restricted to being — 1 , 0 or + 1 . The first scoring 
scheme privileges stories where green nodes are preferentially 
targets in the story (i.e. are produced) and, on the other hand, 
red nodes are preferentially sources in the story (i.e. are con- 
sumed). Let us call this score function enzyme-activation-first, 
as it should privilege arcs from red to green nodes and penaUze 
the inverse as shown in Table 2a. Another possibility is to classify 
first stories in which the concentration change responses are 
privileged as shown in Table 2b. Let us call this score function 
concentration-change-first, as it should privilege arcs from red to 
red nodes or green to green nodes. Finally, we may define a score 
function in which we privilege arcs going from green nodes to red 
nodes; in such a case these arcs represent enzyme inhibition, 
as shown in Table 2c. Let us call this score function enzyme- 
inhibition-first. 

Once an arc weighting scheme has been chosen, we define the 
node weights. For our experiments, we define the value a>{.x) 
for a given node .v as its normalized inten.sity ratio, which is its 
intensity ratio divided by the maximum intensity ratio observed 
in the experiment (if v is a green node) or the minimum intensity 
ratio observed in the experiment divided by the intensity ratio of 
the node (if v is a red node). An example is given in the figure in 
Supplementary Material S4. 

3.2.4 Application to cadmium .stress response in yeast Using the 
three presented score functions, we were able to rank the 222 



Table 2. Weights for different score functions of a story 



(a) Enzyme activation first (b) Concentration Change first (c) Enzyme inhibition first 



Outgoing arcs Outgoing arcs Outgoing arcs 



Arc To red To green Arc To red To green Arc To red To green 

From red 0 1 From red 1 — 1 From red 0 — I 

From green — 1 0 From green — 1 1 From green 1 0 



Note: Table exhibiting the arc weights for interactions between green and red nodes used tor computing the score of a story in the context of a metabolomics experiment: (a) 
weights used to privilege enzyme activation, (b) weights used to privilege concentration change and (c) weights used to privilege enzyme inhibition. 
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stories previously computed and identify the top scoring stories 
for each one of tlie three functions. Figure 3a shows one of tlie 
six optimal stories according to the enzyme-activation-first 
scheme, Figure 3b shows the single optimal story according to 
the concentration-change-first scheme and Figure 3c shows one 
of the two optimal stories found according to the enzyme inhib- 
ition first scheme. The goal of this first analysis is to try to iden- 
tify stories that could correspond to the current knowledge on 
the response of yeast to cadmium exposure, i.e. a story that 
corresponds to the glutatliione biosynthetic pathway previously 
presented in Figure 1. Among the top scored stories, the one 
given by the enzyme-activation-first score function (see 
Figure 3a) agrees well with the current knowledge of yeast re- 
sponse to cadmium. The discussion in Madalinski et al. (2008) 
presents as a result a flux corresponding to the detoxification of 
cadmium by glutathione, explaining that the levels of metabolites 
involved in the glutathione biosynthesis pathway (homocysteine, 
cystathionine, glutamyl-cysteine and glutathione itself) were 
increased following cadmium exposure, which is the same flow 
of matter preserved in the story shown in Figure 3a. The story 
selected by the concentration-change-first score function, shown 
in Figure 3b, preferentially preserves arcs between two nodes of a 
same color. The idea is that an increase (or a decrease) of con- 
centration of a given metabolite could be a side effect of the 
increase (or decrease) of another one. The goal is to minimize 
the number of arcs that suggest some enzyme activity change, i.e. 
arcs that involve red and green nodes. Interestingly, the story 
that scores best with this ranking scheme does not fit with the 
current knowledge of the response to cadmium exposure. This 
means that, in principle, there exists a scenario that uses fewer 
red to green or green to red arcs than the true response (and 
therefore fewer enzyme changes), but this scenario is not the 



(a) Enzyme Activation (b) Concentration Change (c) Enzyme Inhibition 
Score: 1.252 Score: 2.118 Score: 0.762 




Fig. 3. The best stories generated for our analysis taking into account 
only the metabohtes known to be present in the glutathione biosynthesis 
and whose concentration significantly changed after cadmium exposure. 
Green nodes are the ones whose concentration significantly increased in 
the presence of cadmium, whereas the red nodes are the ones whose 
concentration significantly decreased. The diameter of the nodes is pro- 
portional to the concentration change. Solid arcs represent single reac- 
tions connecting the two compounds, whereas dashed ones correspond to 
a chain of at least two reactions 



one taken in practice. There can be a number of reasons why 
this optimal scenario is not taken. Although any enzyme can, in 
principle, be activated or inhibited, in practice, some have more 
degrees of freedom. In addition, some reactions annotated 
as reversible in general, happen to have one clearly favored 
direction in specific conditions. Finally, the story presented in 
Figure 3c preferentially preserves green to red arcs that could 
represent an enzyme inhibition. Again, this scenario does not 
fit with the current knowledge on yeast response to cadmium, 
which indicates that the response is probably not based mostly 
on enzyme inhibition. 

3.2.5 Anthologies In Figure 3a, a story with score 1.252 for the 
enzyme-activation-first score function is presented. However, 
there are other five stories that achieved the same score. 
These tied optimal stories may be combined into a single graph 
representation to ease the analysis of their differences as 
presented in Figure 4. A unique graph representing the union 
of several different stories is called an anthology. Notice that 
differently from stories, which are maximal DAGs, an anthology 
contains at least one cycle. The sources and targets (sinks) of an 
anthology (if any remains) are, however, black nodes only, as 
with stories. In this case, the equivalent stories are due to the fact 
that serine, glycine and oxalacetic acid are all interconnected by 
reversible paths. 

3.3 Second analysis: global response to cadmium exposure 

For the second analysis, we decided to explore the global 
response to cadmium exposure and we considered all 24 
discriminating compounds. One of them, pyroline-hydroxy- 
carboxylate, was eliminated when computing the lightest paths 



GLUTAMATE 




GLUTATHIONE 



Fig. 4. The anthology combining the six maximal stories obtained with 
the enzyme-activation-first score function. Notice that the anthology pre- 
serves the flow of matter observed in the pathway known to be involved 
in cadmium detoxification by the yeast. Once more, green nodes are the 
ones whose concentration significantly increased in the presence of cad- 
mium, whereas the red are the ones whose concentration significantly 
decreased. The diameter of the nodes is proportional to the magnitude 
of the concentration change as measured by the intensity ratio of 
the compound. Solid arcs represent single reactions connecting the two 
compounds, whereas dashed ones correspond to a chain of at least two 
reactions 
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Fig. 5. The compressed network computed for the whole list of 
discriminating compounds of the table in Supplementary Material 3 
and the metabolic network of the yeast strain s288c. Green nodes are 
the ones whose concentration significantly increased in the presence of 
cadmium, whereas the red are the ones whose concentration significantly 
decreased. The diameter of the nodes is proportional to the concentration 
change. Solid arcs represent single reactions connecting the two 
compounds, whereas dashed ones correspond to a chain of at least two 
reactions 

between all pairs of black nodes, as it was part of a small dis- 
connected component of the original input graph, rnost probably 
due to missing information in the metabolic network reconstruc- 
tion as the metabolite was present in the metabolome of the 
strain. The computed compressed network contains 34 nodes 
and 76 arcs, i.e. a compression of 94% in terms of nodes and 
92% in terms of arcs. The resulting compressed network is 
shown in Figure 5. Again, this compressed network is already 
a result per se as it enables to visualize jointly all the possible 
ways of explaining the flow of matter through the network. 
However, in this case again, and probably even more than 
before, the readability is comphcated, and we, therefore, go 
one step further and compute all stories. 

This time, the number of stories is much larger: there 
are 3 934160 in total. In fact, this exact number could only be 
obtained with the recent improvement we proposed in Borassi 
et al. (2013). Before that, the computation would not end in 
reasonable time and we only had an approximate number. In 
our initial analysis, the score function that selected a story that 
best fitted the targeted known metabolic pathway of the gluta- 
thione biosynthesis was the enzyme-activation-first scoring 
scheme. For this reason, we used it also to analyze the larger 
dataset produced in this second analysis, obtaining 20 maximal 
stories presented as an anthology in Figure 6. Considering all the 
metabolites that were measured in the experiment as black nodes 
in our method allows us to have a more global view of the 
organism's response to cadmium exposure. This enables to 
explore whether the other identified paths, apart from the ones 
involved in the glutathione pathway, are part of this response or 
simply side effects of the sulfur redirection, as further discussed 
in the next section. 

3.4 Analytic tools 

All the compressed networks, stories and anthologies presented 
in tills section were coinputed using our algorithin called Touche 
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L-GAMMA-GLUTAMYLCYSTEINE 



L-CYSTEYNILGLYCINE;GLUTATHIONE^^ 

Fig. 6. Anthology corresponding to the 20 stories with the maximal score 
computed for the experiment on yeast s288c exposed to cadmium. Red 
nodes correspond to metabolites whose concentration decreased and 
green nodes to those whose concentration increased in the metabolomics 
experiment. White nodes have their concentration unchanged or it could 
not be measured. The diameter of the nodes is proportional to the con- 
centration change. Solid arcs represent single reactions connecting the 
two compounds, whereas dashed ones correspond to a chain of at least 
two reactions. The arc's thickness represents the frequency of the arc in 
the stories making up the anthology, whereas gray arcs correspond to 
reactions known to be part of the response to cadinium 



(Borassi et al., 2013). For visualization and analytical purposes, 
we used Cytoscape (Shannon et al., 2003), which is a software for 
network visualization, enriched with a plug-in we developed to 
enable loading, visualizing and inspecting the three aforemen- 
tioned objects (coinpressed networks, stories, anthologies) 
inside Cytoscape. The plug-in applies the given visual properties 
corresponding to a metabolomics experiinent (e.g. colour and 
diameter of the nodes, the thickness of an arc corresponding to 
the frequency of the arc in the stories composing the anthology) 
and allows a zoom-in in the dashed arcs, exhibiting the paths 
connecting the two nodes. Both Touche and the Cytoscape plug- 
in are available on demand. 

4 DISCUSSION 

Focusing specifically on the biological application presented in 
the previous section, we may see that exploring the topological 
properties of the stories through the preprocessing of the input 
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network creates a compressed network that captures all the 
relationships between the discriminating compounds in a much 
smaller graph than the whole network. This already allows a 
visual inspection of the observed variations, which is rather dif- 
ficult, if not impossible, in the entire network. On the other hand, 
one may easily highlight in a whole metabolic pathway map 
those metabolites whose concentration were detected as having 
changed using the YeastCyc database. However, because the 
pathways are presented as disconnected, it is not possible to 
follow a path that traverses several pathways (see Figure in 
Supplementary Material S5). To demonstrate the utility of our 
approach, we used data from Madahnski et al. (2008) in which 
the authors monitor changes in metabolite concentration as a 
response of the yeast S.cerevisiae to cadmium, a toxic chemical. 
The aim of tliis study is to analyze the global response of an 
organism to a stress. Using only the metabolomics experiment 
data to choose the discriminating compounds and to rank the 
stories, we are able to obtain stories that correspond well to the 
current biological understanding of the system under study, as 
well as to propose new alternatives that could serve as a basis for 
further experimental validations. Because regulatory information 
and quantitative information are not needed by the method, this 
allows it to be used for metabolic network reconstructions even 
when they are not well refined and where these additional infor- 
mations may be unavailable or incomplete. 

The method herein presented allows visual inspection of a set 
of discriminating compounds (either local or broader) from 
metabolomics data in the compressed network, stories and/or 
anthology with no a priori selected pathways. The metabolic 
stories may be ranked in different ways depending on which 
interpretation one wishes to emphasize for the causal link 
between two affected metabolites: enzyme activation, enzyme 
inhibition or domino effect on the concentration changes of sub- 
strates and products. Equally probable stories under any selected 
ranking scheme can be further grouped into a single anthology 
that summarizes in a single subnetwork all equivalently plausible 
alternative stories. 

4.1 First analysis: local response to cadmium exposure 

The first analysis performed aimed at locally inspecting the yeast 
response to cadmium exposure limited to the biosynthetic path- 
way of glutathione, given in Figure 1 . Of the 222 stories found, 
the ones favoring enzyme activation were clearly closer to our 
current understanding of this response, where an increased 
sulfur flux passes through homocysteine, cystathionine, cysteine 
and glutamyl-cysteine to yield high levels of glutathione 
(Madalinski et al., 2008). This same flow of matter is captured 
in the anthology combining the six best stories under this scoring 
scheme, shown in Figure 4. 

Interestingly, we show that there exists one scenario that, in 
principle, uses fewer enzymes to explain the observed changes in 
concentration. This is the scenario that favors concentration 
changes, shown in Figure 5b. However, this scenario does not 
match the current knowledge of the main pathway of yeast 
response to cadmium. In fact, it even uses some reactions in the 
opposite direction. Because these reactions are annotated as 
reversible, they can be taken in both directions, at least in theory, 
and this explains that we found these alternative stories. Those are 



scenarios that are a priori possible. They are not necessarily 
'chosen' in practice, possibly because the reactions are only 
reversible under some conditions that are not met in this experi- 
ment. Unfortunately, the precise conditions under which a reac- 
tion is reversible are in general not well known. The addition 
of such knowledge would for sure enable to reduce substantially 
the number of stories we output, as a large part of the combina- 
torial explosion we observe comes from these 'cycles'. Conversely, 
understanding why some possible scenarios are not taken in prac- 
tice could help to better annotate the reversibility of reactions. 

From the list of discriminating compounds identified in 
Madalinski et al. (2008), the ones that are involved in the gluta- 
thione biosynthetic pathway (as shown in bold in Fig. 1) are as 
follows: 0-acetylhomoserine, methionine, serine, cystathionine, 
glutamate, y-glutamyl-cysteine, glycine and glutathione. All of 
them are present in the compressed networks, stories and 
anthologies herein presented (Figs 2-6). As concerns cysteine 
and homocysteine, they were either not measured or not discri- 
minating in Madalinski et al. (2008), thus in our analysis they 
appear as white nodes and may be compressed inside an arc 
(dashed arcs in Figs 2-6). Cysteine is included in the dashed 
arc linking cystathionine and L-gamma-glutamyl-cysteine that 
is its expected place based on the biosynthetic pathway of gluta- 
thione. Homocysteine is represented as a white node in all fig- 
ures. Because it is at a crossroads between three black nodes 
(cystathione, methionine and O-acetylhomoserine), it could not 
be compressed. 

Interestingly, the compounds involved in the methyl cycle, 
which is a sulfur salvage pathway (see Fig. 1), were not recovered 
in the highest score stories found in our first analysis. The reason 
is that the lightest path found between methionine and 
cystathione in that analysis passed through the reaction catalyzed 
by the enzyme homocysteine 5-methyltransferase (Mhtl), which 
is assigned as reversible in the YeastCyc database (Caspi et al., 
2010) and in our data. This enzyme was described as recycling 
S-adenosylmethionine (AdoMet) to metliionine (Thomas et al., 
2000). 

4.2 Second analysis: global response to cadmium exposure 

This more local view of the behavior of the metabolic network of 
yeast in this stress condition may be contrasted with the second 
analysis, where the whole list of discriminating compounds was 
considered. The anthology combining the 20 best stories under 
the scoring scheme favoring enzyme activation is presented in 
Figure 6, where the reactions corresponding to the glutathione 
biosynthesis are highlighted in gray. This is a strong point of our 
method, as it allows exploring alternative but close scenarios 
through the analysis of these (and possibly other) stories 
altogether, which might provide new insights on the underlying 
processes that took place under the given conditions. 

Among the 35 nodes presented in this anthology, eight have 
sulfur in their chemical structure: AdoMet, y-glutamylcysteine, 
5-methylthioadenosine (MTA), O-acetyl-L-homoserine, cystei- 
nylglycine, glutathione, cystathionine and L-methionine. 
Among these sulfured metabolites, the only one that is not 
involved in the glutathione biosynthesis is MTA, which is instead 
involved in the MTA cycle, a sulfur salvage pathway (Thomas 
and Surdin-Kerjan, 1997). This recycles AdoMet to methionine 
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through a chain of reactions, whereas Mhtl (mentioned earlier in 
the text) can also perform it in one step, which is important for 
controlling the intracellular ratio between these two metabolites 
(Thomas et al., 2000) Although there is a redirection of sulfur 
flux to glutathione biosynthesis after cadmium exposure, the 
levels of MTA increased as well as those of arginine, which is 
a precursor for MTA. The metabolites in the methyl cycle are 
recovered, with the white nodes AdoMet and homocysteine pre- 
sent in the anthology and the metabolite 5'-adenosyl-homocyst- 
eine compressed into the arc between them. We have previously 
tried to link arginine to sulfur metabolism by emphasizing that it 
is a precursor of spermidine, a polyamine metabolite that is itself 
involved in the biosynthesis of MTA, a metaboUte associated 
with the methyl cycle and whose levels are increased after cad- 
mium exposure (Madalinski et al., 2008). However, experimental 
data lacked to support this assumption. By using the metabolic 
stories based approach, the increased levels of arginine are linked 
to decreased concentrations of citrulhne, which has not been 
formally identified in our experimental conditions, and which 
is itself linked to glutamate. Besides, citruUine was identified as 
a discriminating compound in Madalinski et al. (2008), but was 
only indicated as putative, requiring more analysis for final iden- 
tification. Our results seem to confirm that citruUine was cor- 
rectly identified. This emphasizes the relevance of using this kind 
of approach to generate biological hypotheses that have to be 
further investigated by biologists. Of note, such a link between 
arginine and sulfur metabolism has been noticed in other organ- 
isms (Sekowska et al., 2001) and links between nitric oxide and 
polyamines have been established with cadmium toxicity in 
wheat roots (Groppa et al., 2008). Furthermore, this global 
view of the discriminating compounds links the sulfur metabol- 
ism to non-sulfur amino acids and other metabolites through 
intermediates of the central metabolism. The amino acids that 
are precursors to the glutathione synthesis have their levels 
reduced as expected, whereas most of the others increased. 
This agrees with the fact that global protein synthesis rapidly 
drops after cadmium exposure (Lafaye et al., 2005), reducing 
the consumption of amino acids not directly connected to gluta- 
thione synthesis. 

4.3 Perspectives 

We presented a generic method that enables to analyze metabo- 
lomics data. This method requires simple input and can be 
applied to a wide variety of situations. Clearly, the results of 
the method can be improved with the addition of other types 
of data. For instance, the use of carbon tracing experiments 
could help in focusing directly on the stories that are involved 
in the response to the stress condition, instead of considering the 
set of all possible stories. Besides, we assumed that the set of 
discriminating compounds did not need to be questioned. 
However, these are predicted based on the analysis of peaks in 
a spectrum. We remark that extracting such information is in 
itself a bioinformatics challenge. Therefore, a possible extension 
of the method could be to take into account noisy data, i.e. to 
deal with a level of confidence for the roles of discriminating 
compounds and non-discriminating compounds. From the mod- 
eling point of view, we enforce that each story corresponds to a 
fiow of matter by the acyclicity constraint. We could relax this 



constraint by allowing internal cycles, and therefore computing, 
for each combination of sources and targets, a single story. This 
will lead to a completely different model and is beyond the scope 
of this article. 
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